Ground truth bias in external cluster validity indices

نویسندگان

  • Yang Lei
  • James C. Bezdek
  • Simone Romano
  • Nguyen Xuan Vinh
  • Jeffrey Chan
  • James Bailey
چکیده

External cluster validity indices (CVIs) are used to quantify the quality of a clustering by comparing the similarity between the clustering and a ground truth partition. However, some external CVIs show a biased behaviour when selecting the most similar clustering. Users may consequently be misguided by such results. Recognizing and understanding the bias behaviour of CVIs is therefore crucial. It has been noticed that some external CVIs exhibit a preferential bias towards a larger or smaller number of clusters which is monotonic (directly or inversely) in the number of clusters in candidate partitions. This type of bias is caused by the functional form of the CVI model. For example, the popular Rand Index (RI) exhibits a monotone increasing (NCinc) bias, while the Jaccard Index (JI) index suffers from a monotone decreasing (NCdec) bias. This type of bias has been previously recognized in the literature. In this work, we identify a new type of bias arising from the distribution of the ground truth (reference) partition against which candidate partitions are compared. We call this new type of bias ground truth (GT) bias. This type of bias occurs if a change in the reference partition causes a change in the bias ∗Corresponding author Email addresses: [email protected] (Yang Lei), [email protected] (James C. Bezdek), [email protected] (Simone Romano), [email protected] (Nguyen Xuan Vinh), [email protected] (Jeffrey Chan), [email protected] (James Bailey) Preprint submitted to Elsevier June 20, 2016 ar X iv :1 60 6. 05 59 6v 1 [ st at .M L ] 1 7 Ju n 20 16 status (e.g., NCinc, NCdec) of a CVI. For example, NCinc bias in the RI can be changed to NCdec bias by skewing the distribution of clusters in the ground truth partition. It is important for users to be aware of this new type of biased behaviour, since it may affect the interpretations of CVI results. The objective of this article is to study the empirical and theoretical implications of GT bias. To the best of our knowledge, this is the first extensive study of such a property for external cluster validity indices. Our computational experiments show that 5 of 26 indices studied in this paper exhibit GT bias. Following the numerical examples, we provide a theoretical analysis of GT bias based on the relationship between the RI and quadratic entropy. Specifically, we prove that the quadratic entropy of the ground truth partition provides a computable test which predicts the NC bias status of the Rand Index.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Measuring Player's Behaviour Change over Time in Public Goods Game

An important issue in public goods game is whether player’s behaviour changes over time, and if so, how significant it is. In this game players can be classified into different groups according to the level of their participation in the public good. This problem can be considered as a concept drift problem by asking the amount of change that happens to the clusters of players over a sequence of...

متن کامل

Comprehensive Cross-Hierarchy Cluster Agreement Evaluation

Hierarchical clustering represents a family of widely used clustering approaches that can organize objects into a hierarchy based on the similarity in objects’ feature values. One significant obstacle facing hierarchical clustering research today is the lack of general and robust evaluation methods. Existing works rely on a range of evaluation techniques including both internal (no ground-truth...

متن کامل

Development of An External Cluster Validity Index using Probabilistic Approach and Min-max Distance

Validating a given clustering result is a very challenging task in real world. So for this purpose, several cluster validity indices have been developed in the literature. Cluster validity indices are divided into two main categories: external and internal. External cluster validity indices rely on some supervised information available and internal validity indices utilize the intrinsic structu...

متن کامل

Clustering algorithms used in 3D scene segmentation

In this paper, we implement and compare three different clustering algorithms for the purpose of 3D image segmentation. Specifically, the K-means, Mean Shift, and Hierarchical methods are studied, and their performance is compared using cluster validity methods. Performance was analyzed in two ways, first by comparing independent results from each, and second, by comparing results where Hierarc...

متن کامل

Unsupervised Feature Selection by Means of External Validity Indices

Feature selection for unsupervised data is a difficult task because a reference partition is not available to evaluate the relevance of the features. Recently, different proposals of methods for consensus clustering have used external validity indices to assess the agreement among partitions obtained by clustering algorithms with different parameter values. Theses indices are independent of the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Pattern Recognition

دوره 65  شماره 

صفحات  -

تاریخ انتشار 2017